BONN: Bayesian Optimized Binary Neural Network

71

3.7.3

Bayesian Pruning

After binarizing CNNs, we pruned 1-bit CNNs under the same Bayesian learning framework.

Different channels might follow a similar distribution, based on which similar channels are

combined for pruning. From the mathematical aspect, we achieve a Bayesian formulation of

BNN pruning by directly extending our basic idea in [78], which systematically calculates

compact 1-bit CNNs. We represent the kernel weights of the l-th layer Kl as a tensor

RCl

o×Cl

i×Hl×W l, where Cl

o and Cl

i denote the numbers of output and input channels,

respectively, and Hl and W l are the height and width of the kernels, respectively. For

clarity, we define

Kl = [Kl

1, Kl

2, ..., Kl

Cl

o],

(3.104)

where Kl

i, i = 1, 2, ..., Cl

o, is a 3-dimensional filterRCl

i×Hl×W l. For simplicity, l is omitted

from the remainder of this section. To prune 1-bit CNNs, we assimilate similar filters into

the same one based on a controlling learning process. To do this, we first divide K into

different groups using the K-means algorithm and then replace the filters of each group by

their average during optimization. This process assumes that Ki in the same group follows

the same Gaussian distribution during training. Then the pruning problem becomes how

to find the average K to replace all Ki’s, which follows the same distribution. It leads to a

similar problem as in Eq. 3.99. It should be noted that the learning process with a Gaussian

distribution constraint is widely considered in [82].

Accordingly, Bayesian learning is used to prune 1-bit CNNs. We denote ϵ as the difference

between a filter and its mean, i.e., ϵ = KK, following a Gaussian distribution for

simplicity. To calculate K, we minimize ϵ based on MAP in our Bayesian framework, and

we have

K = arg max

K p(K|ϵ) = arg max

K p(ϵ|K)p(K),

(3.105)

p(ϵ|K)exp(1

2ν ||ϵ||2

2)exp(1

2ν ||KK||2

2),

(3.106)

and p(K) is similar to Eq. 3.101 but with one mode. Thus, we have

min||KK||2

2 + ν(KK)T Ψ1(KK)

+ ν log



det(Ψ)



,

(3.107)

which is called the Bayesian pruning loss. In summary, our Bayesian pruning solves the

problem more generally, assuming that similar kernels follow a Gaussian distribution and

will finally be represented by their centers for pruning. From this viewpoint, we can obtain

a more general pruning method, which is more suitable for binary neural networks than

the existing ones. Moreover, we take the latent distributions of kernel weights, features, and

filters into consideration in the same framework and introduce Bayesian losses and Bayesian

pruning to improve the capacity of 1-bit CNNs. Comparative experimental results on model

pruning also demonstrate the superiority of our BONNs [287] over existing pruning methods.

3.7.4

BONNs

We employ the three Bayesian losses to optimize 1-bit CNNs, which form our Bayesian

Optimized 1-bit CNNs (BONNs). To do this, we reformulate the first two Bayesian losses